In this paper we propose to use utterance-level Permutation InvariantTraining (uPIT) for speaker independent multi-talker speech separation anddenoising, simultaneously. Specifically, we train deep bi-directional LongShort-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, forsingle-channel speaker independent multi-talker speech separation in multiplenoisy conditions, including both synthetic and real-life noise signals. Wefocus our experiments on generalizability and noise robustness of models thatrely on various types of a priori knowledge e.g. in terms of noise type andnumber of simultaneous speakers. We show that deep bi-directional LSTM RNNstrained using uPIT in noisy environments can improve the Signal-to-DistortionRatio (SDR) as well as the Extended Short-Time Objective Intelligibility(ESTOI) measure, on the speaker independent multi-talker speech separation anddenoising task, for various noise types and Signal-to-Noise Ratios (SNRs).Specifically, we first show that LSTM RNNs can achieve large SDR and ESTOIimprovements, when evaluated using known noise types, and that a single modelis capable of handling multiple noise types with only a slight decrease inperformance. Furthermore, we show that a single LSTM RNN can handle bothtwo-speaker and three-speaker noisy mixtures, without a priori knowledge aboutthe exact number of speakers. Finally, we show that LSTM RNNs trained usinguPIT generalize well to noise types not seen during training.
展开▼